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ABSTRACT 


In recent years, significant advancements in emotional speech 
synthesis have emerged, driven by deep learning-based mod- 
els and vocoders. This paper conducts experiments on vi- 
tal text-to-speech components, namely text-to-speech models 
(Tacotron 2) and vocoders (HiFi-GAN), utilizing a custom- 
recorded dataset featuring emotional speech samples with In- 
dian accents across five distinct emotional states, happy, sad, 
neutral, angry and excited, to enhance emotional speech syn- 
thesis quality. We initially employ a Tacotron 2 model with 
WaveGlow vocoder and subsequently introduce the HiFiGAN 
vocoder. Throughout the study, we experiment with hyper- 
parameters, activation functions and kernel sizes to optimize 
HiFi-GAN’s performance on emotional data. This work con- 
tributes to advancing emotional speech synthesis in Indian ac- 
cents and offers valuable insights into the effectiveness of dif- 
ferent vocoder configurations. 


Index Terms— Text-to-speech, emotional speech synthe- 
sis, vocoders, Tacotron 2, HiFiGAN 


1. INTRODUCTION 


The primary form of communication between humans is 
speaking. As humans, we express ourselves and our minds 
through speaking, sign language and gestures. It doesn’t feel 
unnatural or artificial when there is a human touch when 
interacting. In today’s world people on a daily basis use 
electronic gadgets for entertainment, work, communication 
etc. End-to-end text-to-speech systems have played 
a major role in minimizing the gap between the users and 
the electronics. It is an interactive way of communication 
between humans and devices and it comes with many ap- 
plications as well. TTS, Text-to-speech is a technology that 
is used to generate audio from a text input. Emotional TTS 
is a branch which deals with adding emotions to the speech 
generated giving it a human touch in a similar fashion to how 


people express emotions when they communicate. The tone 
to TTS systems is usually constant and emotionless. The 
emotions can be added by changing the tempo, pitch and am- 
plitude but the voice would sound more robotic than human. 
With many architectures available, this paper discusses the 
related experiments and results in training various state-of- 
the-art TTS models like Tacotron2 [3], WaveGlow and 
HiFIGAN for generating emotional text for low resource 
dataset. We have targeted five emotions that are Happy, Sad, 
Neutral, Angry and Excited. Creating speech involves not 
only considering linguistic and prosodic aspects but also hav- 
ing a deep understanding of the complex degrees of human 
emotions. The synthesized audio samples are available at 

://abhigyanbasak248. github.io/results.github.io/icassp.ht 


2. RELATED WORKS 


O. Kwon et. al [6] proposed an approach for emotional speech 
synthesis by incorporating global style tokens (GSTs) into the 
Tacotron2 framework. The system consists of style archi- 
tecture for emotion-related style embedding and Tacotron2 
for text-to-speech synthesis. From emotional speech refer- 
ences, the style architecture extracts style embedding vectors, 
while from input text sequences, the Tacotron2 encoder gen- 
erates hidden linguistic feature vectors which are then com- 
bined and processed by the Tacotron2 decoder to produce 
mel-spectrograms and are used by the WaveNet vocoder to 
synthesize the emotional speech waveform. Korean speech 
dataset was used as corpus, encompassing happy, angry, sad, 
and neutral emotions. With a 95% confidence interval, the 
emotional evaluation results of happy (3.51 + 0.28), sad (3.26 
+ 0.17), and angry (3.58 + 0.16) proved the relevance of the 
model in generating emotional speech. To make the TTS 
more expressive, acoustic condition modeling and sentiment 
analysis models based on Fastspeech?2 [7] are proposed by Y. 
Feng et.al [8]. Two acoustic encoders are employed to ex- 


tract utterance-level and phoneme-level vectors from the tar- 
get speech, and a sentiment analysis model is used to derive 
objective sentiment features from the text, which are then ex- 
panded and fused with the acoustic model’s output vector. S. 
K. Nithin et. al [9] evaluated the performance of two neu- 
ral TTS models, Tacotron2 and Fastspeech2, in synthesizing 
emotional speech through Transfer Learning. Initially trained 
on LJSpeech dataset, both models are fine-tuned using emo- 
tional speech data from ESD. Emotion-specific models are 
employed to generate speech samples, achieving a classifica- 
tion accuracy of 79% for Fastspeech2 and 90% for Tacotron2, 
as assessed by the ScSer speech emotion recognition model 


10). 


3. DATASET 


In this study, we have built a comprehensive emotional speech 
dataset uniquely tailored to Indian accents. This dataset en- 
compasses a broad spectrum of human emotions: happiness, 
sadness, anger, neutrality, and excitement with average au- 
dio duration being 4-5 seconds. This rich dataset serves as 
a foundational resource for advancing the field of emotional 
speech synthesis and understanding the intricacies of emo- 
tional expression within the context of Indian accents, setting 
the stage for more nuanced and culturally relevant speech syn- 
thesis systems. 


Table 1. Dataset Details 


Emotion Samples 
Happy 1500 
Sad 1600 
Neutral 1800 
Excitement 1800 
Angry 500 


4. EXPERIMENTS 


4.1. Tacotron 2 


Tacotron 2 is a fully neural TTS system that combines a 
sequence-to-sequence recurrent network with attention to 
predict mel spectrograms. It employs a two-step process: 
first, a text encoder converts input text into a sequence of nu- 
merical embeddings. Next, a recurrent neural network-based 
decoder generates a mel spectrogram, representing audio fre- 
quencies over time. This spectrogram is then converted into 
audible speech using a vocoder. The vocoder of choice here 
is WaveGlow. 

We extended the capabilities of Tacotron2 by integrating 
several key components to enhance the quality and expres- 
siveness of generated speech. We incorporated a dedicated 
prosody prediction network to capture prosodic information 


within the mel-spectrograms, thereby improving the intona- 
tion and rhythmic aspects of synthesized speech. 

We also experimented with many hyperparameters of 
Tacotron2 to finetune it to our dataset using MOS Score as 
the metric. The experiments on the Learning Rate and Gating 
Threshold are given in Table 2. 


Table 2. Experimentation with Learning Rate and Gating 
Threshold of Tacotron2 


Learning Rate Gate Threshold MOS 
le-4 0.5 1.4 
le-—5 0.9 2.2 

0.7 2.4 
0.5 2.7 
0.3 2.5 
0.1 2.1 
le—6 0.5 3 
0.4 2.4 
0.3 2.3 


Our first experimentation revealed that Learning Rate of 
le-6 and Gating Threshold of 0.5 gave us the best results. 
We further experimented with Weight Decay and Dropouts, 
keeping the Gating Threshold as 0.5, which are listed in Table 
3. 


Table 3. Experimentation with Weight Decay and Dropouts 
of Tacotron2 


Learning Rate WeightDecay Dropout MOS 

le-—4 le—6 0 1.2 
le-—6 0.1 1.4 

le—5 le-—5 0.1 2.6 
le-—6 0.1 2.7 
le-—7 0.1 2.4 

le—-—6 le—5 0 2.3 
le-—6 0.1 se) 


This experimentation further revealed that Weight Decay of 
le-6 and Dropout of 0.1 gives us the best MOS Score. 


4.2. Waveglow 


Waveglow is a flow based generative model used for speech 
synthesis. It takes in a sequence of mel-spectrogram as its 
input. It uses the information from these for faster audio syn- 
thesis with highly efficient results. The waveglow model uses 
12 coupling layers and 12 invertible 1x1 convolutions. As 
Glow was successful in generating high quality realistic im- 
ages while using a simple structure of invertible 1x1 convo- 
lution. Waveglow was integrated with a similar idea based of 
the outstanding results of Glow. The invertible 1x1 convolu- 
tion allows easy computation in both forward and backward 


transformations minimizing the issues while training. Wave- 
glow after training understands the complexities and depen- 
dencies and effectively maps the mel-spectrograms and wave- 
forms. 

We employed a pretrained WaveGlow model trained on 
LJ Speech dataset to enhance Tacotron2 experiments. Our 
custom-trained Tacotron2 provided mel-spectrograms for 
WaveGlow inference. A denoising step post WaveGlow 
ensured clean audio output. Text prompts were fed into 
Tacotron2 for generating mel spectrograms, which were then 
processed by WaveGlow to produce high-quality audio wave- 
forms. 


4.3. HiFi-GAN 


HiFi-GAN is a speech synthesis model that focuses on effi- 
ciently producing high-fidelity speech waveforms from mel- 
spectrograms. It incorporates several technical innovations 
to enhance sample quality and synthesis efficiency and em- 
ploys a generator with a multi-receptive field fusion (MRF) 
module, enabling it to capture diverse patterns in audio data 
effectively. Moreover, the model introduces a multi-period 
discriminator (MPD) to specifically handle sinusoidal signals 
with various periods, a crucial factor in generating realistic 
speech audio. To enhance training stability and audio quality, 
HiFi-GAN also utilises a mel-spectrogram loss. 


4.3.1. Kernel Experiments 


We started with kernel experimentation, which aimed to op- 
timise the performance of the resblock and upsample kernels 
in our deep learning model. The original resblock kernel con- 
figuration, denoted as 3,7,11, yielded successful outcomes. 
However, we observed that altering the order of dimensions 
while maintaining the same values did not produce desir- 
able results as the model failed to generate the expected 
output. Similarly, other dimension permutations, such as 
7,5,3, 11,7,3, 11,5,7, 5,7,11, and 5,7,3, also exhibited the 
same issue, indicating the sensitivity of the resblock kernel 
dimensions to their order. 

Modifications to the upsample kernel dimensions also did 
not significantly affect the model’s performance. The orig- 
inal upsample kernel configuration, 4,4,16,16, successfully 
generated the desired outputs. Alterations in the dimension 
order, such as 16,16,4,4 and 4,16,16,16, did not lead to a sub- 
stantial improvement or deterioration in results. Overall, our 
kernel experimentation highlights the importance of selecting 
appropriate dimensions for resblock kernels while suggesting 
that upsample kernel configurations can be relatively stable in 
achieving the desired outcomes in our deep learning model. 


4.3.2. Activation Function Experiments 


In this section, we present the results of our experiments with 
the activation functions in the hidden and output layers. We 


experimented with the activation function of the output layer 
with Sigmoid, SiLU, Softsign and Leaky ReLU. However, we 
observed that they didn’t perform as well as the original Tanh 
activation function. We have listed our experiments in Table 
4 and the loss curves in Figure | and 2. 


Table 4. Output Layer Experimentation on HiFi-GAN 


Activation Function MOS 
Tanh 4.35 
Sigmoid 4 
SiLU 3.9 
Softsign 2 
Leaky ReLU 2.5 


We experimented with the hidden layer activation func- 
tion from the original Leaky ReLU to ELU, CELU, SELU, 
GELU, HardTanh and Softplus. However, we observed that 
none of the new activation functions provided better results 
than LeakyReLU. We have listed our experiments in Table 5 
and the loss curves in Figure 3 and 4. 


Table 5. Hidden Layer Experimentation on HiFi-GAN 


Activation Function MOS 
Leaky ReLU 4.16 
ELU 2.5 
CELU 3.1 
SELU 2.2 
GELU 2.9 
HandTanh 1.8 
Softplus 12 


5. CONCLUSION 


In conclusion, our research in emotional speech synthesis 
with Indian accents has led to valuable findings. We cre- 
ated a dataset with five emotions and trained the Tacotron 2 
model with the WaveGlow vocoder, which enhanced Indian- 
accented emotional speech synthesis. We then integrated the 
HifiGAN vocoder into our pipeline and conducted various 
experiments with loss functions and kernel sizes, revealing 
varying results. This work contributes to emotional speech 
synthesis for low resource dataset and underscores the im- 
portance of adapting these technologies to diverse linguistic 
and cultural contexts. Our dataset, methods, and experimen- 
tal insights offer valuable resources for future research. We 
move closer to achieving emotionally expressive, culturally 
nuanced, and natural speech synthesis systems for Indian 
accents and beyond, bridging the gap between artificial and 
human-generated speech. 
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